Metal backend: enable Parakeet #16562

manuelcandales · 2026-01-13T16:56:30Z

This pull request builds on top of #16499 to introduce support for the Parakeet model in the Metal backend. The most important changes are grouped below:

Parakeet export/lowering:

Added support for Metal lowering to the Parakeet export/lowering script
Provided custom linear decomposition. This achieved two objectives: avoided call to addmm, and avoided call to reinterpret_tensor_wrapper with 0 stride

Operator updates:

Added implementation for aoti_torch_mps_bmm_out to support batched matrix multiplication (bmm) in the Metal backend
Fixed input channel dimension handling for grouped convolutions in aoti_torch_mps_convolution by reading the correct dimension from the weight tensor.

Shim layer updates:

Added implementation for aoti_torch_new_tensor_handle
Enabled non-zero tensor storage offsets in aoti_torch__reinterpret_tensor by adjusting the data pointer instead of rejecting non-zero offsets, and updating memory tracking and Metal buffer mapping logic accordingly.
Added the metal_buffer_nocopy function to map arbitrary memory pointers into Metal buffers, supporting cases where the data pointer is offset.
Improved error messages in several stubbed shim functions by including the function name in the exception message for easier debugging.

pytorch-bot · 2026-01-13T16:56:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16562

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job, 1 Unrelated Failure

As of commit 69398ff with merge base dbf3c37 ():

CANCELLED JOB - The following job was cancelled. Please retry:

Apple / test-demo-ios / mobile-job (ios) (gh)
##[error]The operation was canceled.

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / android / run-emulator (gh) (#16137)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copilot

Pull request overview

This PR enables support for the Parakeet model in the Metal backend by implementing the necessary operators, fixing existing issues, and adding export/lowering infrastructure. The changes build upon existing Metal backend functionality to support this specific ASR model.

Changes:

Added Metal backend support to Parakeet export script with custom linear decomposition to avoid addmm and handle tensor reinterpretation
Implemented aoti_torch_mps_bmm_out for batched matrix multiplication and fixed grouped convolution input channel handling
Enhanced shim layer to support non-zero tensor storage offsets and new tensor handle creation

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
examples/models/parakeet/export_parakeet_tdt.py	Added Metal backend support with custom decomposition and dtype handling
examples/models/parakeet/README.md	Updated documentation for Metal export and runner usage
backends/apple/metal/runtime/shims/memory.cpp	Implemented storage offset support and new tensor handle creation
backends/apple/metal/runtime/shims/et_metal_ops.mm	Added bmm_out implementation and fixed grouped convolution
backends/apple/metal/runtime/shims/et_metal_ops.h	Added bmm_out function declaration
backends/apple/metal/runtime/shims/et_metal.mm	Added metal_buffer_nocopy function
backends/apple/metal/runtime/shims/et_metal.h	Added metal_buffer_nocopy declaration
backends/apple/metal/metal_backend.py	Updated supported fallback kernels list
backends/aoti/common_shims.cpp	Improved error messages for stubbed functions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-13T17:33:20Z

backends/apple/metal/runtime/shims/et_metal_ops.mm

+      if (self_strides[2] != 1 || self_strides[1] != K || self_strides[0] != M * K) {
+        ET_LOG(Error, "aoti_torch_mps_bmm_out: self tensor must be contiguous. "
+               "Only dense row-major layout supported; transposed/view tensors are unsupported. "
+               "Expected strides=[%lld,%lld,1] for shape=[%lld,%lld,%lld], got strides=[%d,%d,%d].",
+               (long long)(M * K), (long long)K, (long long)B, (long long)M, (long long)K,
+               self_strides[0], self_strides[1], self_strides[2]);


The format specifiers for stride values are incorrect. The strides are int64_t but are being printed with %d instead of %lld. This will cause truncation on 64-bit values and incorrect error messages. The same issue exists for mat2_strides and out_strides error messages.

Copilot · 2026-01-13T17:33:20Z

backends/apple/metal/runtime/shims/et_metal_ops.mm

+      if (mat2_strides[2] != 1 || mat2_strides[1] != N || mat2_strides[0] != K * N) {
+        ET_LOG(Error, "aoti_torch_mps_bmm_out: mat2 tensor must be contiguous. "
+               "Only dense row-major layout supported; transposed/view tensors are unsupported. "
+               "Expected strides=[%lld,%lld,1] for shape=[%lld,%lld,%lld], got strides=[%d,%d,%d].",
+               (long long)(K * N), (long long)N, (long long)B, (long long)K, (long long)N,
+               mat2_strides[0], mat2_strides[1], mat2_strides[2]);


The format specifiers for stride values should be %lld instead of %d. Strides are int64_t values, not int32_t.

Copilot · 2026-01-13T17:33:20Z

backends/apple/metal/runtime/shims/et_metal_ops.mm

+      if (out_strides[2] != 1 || out_strides[1] != N || out_strides[0] != M * N) {
+        ET_LOG(Error, "aoti_torch_mps_bmm_out: out tensor must be contiguous. "
+               "Only dense row-major layout supported; transposed/view tensors are unsupported. "
+               "Expected strides=[%lld,%lld,1] for shape=[%lld,%lld,%lld], got strides=[%d,%d,%d].",
+               (long long)(M * N), (long long)N, (long long)B, (long long)M, (long long)N,
+               out_strides[0], out_strides[1], out_strides[2]);


The format specifiers for stride values should be %lld instead of %d. Strides are int64_t values, not int32_t.

Copilot · 2026-01-13T17:33:21Z

backends/apple/metal/runtime/shims/memory.cpp

-      : memory_to_n_tensor[data_ptr] + 1;
+      : memory_to_n_tensor[adjusted_data] + 1;

  ET_LOG(Debug, "aoti_torch__reinterpret_tensor: successfull");


Corrected spelling of 'successfull' to 'successful'.

Suggested change

ET_LOG(Debug, "aoti_torch__reinterpret_tensor: successfull");

ET_LOG(Debug, "aoti_torch__reinterpret_tensor: successful");

Copilot · 2026-01-13T17:33:21Z

backends/apple/metal/runtime/shims/memory.cpp

+      ? NOT_OWN
+      : memory_to_n_tensor[data_ptr] + 1;
+
+  ET_LOG(Debug, "aoti_torch_new_tensor_handle: successfull");


Corrected spelling of 'successfull' to 'successful'.

Suggested change

ET_LOG(Debug, "aoti_torch_new_tensor_handle: successfull");

ET_LOG(Debug, "aoti_torch_new_tensor_handle: successful");

Copilot · 2026-01-13T17:33:21Z

backends/apple/metal/runtime/shims/memory.cpp

+        dtype_to_element_size(dtype),
+        adjusted_data);
+
+    metal_buffer_nocopy(adjusted_data, tensor->nbytes(), true);


Use-after-free risk due to incorrect lifetime management of reinterpreted tensors with non-zero storage_offset. In aoti_torch__reinterpret_tensor, adjusted_data is computed from the original tensor's data_ptr (line 471) and a new tensor view is created on that interior pointer (lines 483–488), but the reference counting and buffer mapping treat adjusted_data as an independent allocation: metal_buffer_nocopy wraps adjusted_data in a new MTLBuffer via newBufferWithBytesNoCopy and registers it in ptr_to_mtl_buffer, and memory_to_n_tensor[adjusted_data] is incremented (lines 512–517) rather than sharing the original data_ptr's entry. When the original tensor is later deleted first, aoti_torch_delete_tensor_object will see data_ptr with refcount 1 and free the underlying memory (either by metal_deallocate_buffer for MPS buffers or free for CPU buffers), while the reinterpreted tensor still exists and its MTLBuffer (and mutable_data_ptr() pointing to adjusted_data) continue to be used by Metal ops via get_mtl_buffer / ETMetalKernelFunction::setArg. This leaves adjusted_data and its MTLBuffer pointing at freed memory, so subsequent Metal graph executions or kernel launches can read/write out-of-bounds or freed memory, enabling memory corruption and potential process compromise when models or AOT runtimes use _reinterpret_tensor with non-zero storage offsets.

Severity: HIGH. Confidence: 8

JacobSzwejbka · 2026-01-13T17:59:53Z

examples/models/parakeet/export_parakeet_tdt.py

+    return partitioner, programs
+
+
+def _linear_bias_decomposition(input, weight, bias=None):


we naturally decompose this, if its not being decomposed its because the aoti backend probably specifies not to decompose it. So we should probably just disable that for metal

JacobSzwejbka · 2026-01-13T18:02:24Z

backends/apple/metal/runtime/shims/et_metal_ops.mm

+  @autoreleasepool {
+    try {
+      // Convert AOTITensorHandle to ExecutorTorch tensors
+      auto out_tensor = reinterpret_cast<Tensor*>(out);


Nit: It would be useful to wrap this cast in a helper function since we plan on changing the tensor definition backing the shim.

JacobSzwejbka · 2026-01-13T18:06:27Z

backends/apple/metal/runtime/shims/et_metal_ops.mm

+
+      // Create cache key for this batched matrix multiplication
+      // Cache key includes: op_name, shape params {B, M, K, N}, dtype, transpose_flag
+      // This allows reuse when same BMM shape/dtype is called repeatedly


Does this happen often with transformers? I would think the shape mostly changes each iteration with growing context length

JacobSzwejbka · 2026-01-13T18:08:31Z

backends/apple/metal/runtime/shims/et_metal_ops.mm

+      (void)device;  // Used for validation, consistent with other ops
+
+      // Get Metal buffers for input and output tensors
+      id<MTLBuffer> self_buffer = get_mtl_buffer(self_tensor, "aoti_torch_mps_bmm_out", "self");


Just sort of curious why do you pass the fn name and arg name, are these keys to a dict?

just for logging

JacobSzwejbka · 2026-01-13T18:09:11Z

backends/apple/metal/runtime/shims/et_metal_ops.mm

+        ET_LOG(Error, "aoti_torch_mps_bmm_out: Failed to get Metal device");
+        return Error::Internal;
+      }
+      (void)device;  // Used for validation, consistent with other ops


Whats going on here?

JacobSzwejbka · 2026-01-13T18:11:33Z

backends/apple/metal/runtime/shims/et_metal_ops.mm

+
+      // Validate tensor dimensions - bmm requires 3-D tensors
+      if (self_tensor->dim() != 3 || mat2_tensor->dim() != 3 || out_tensor->dim() != 3) {
+        ET_LOG(Error, "aoti_torch_mps_bmm_out: tensors must be 3-D. "


If the failure is that dims is wrong why is it useful to print the shape. Especially since dims could be too big?

mergennachin

See inline comments

mergennachin · 2026-01-13T17:59:02Z

examples/models/parakeet/README.md

 cmake --workflow --preset llm-release
+
+# For Metal (macOS)
+cmake --workflow --preset llm-debug-metal


Let's switch back to llm-release-metal in the instructions for consistency

mergennachin · 2026-01-13T18:13:33Z

backends/apple/metal/runtime/shims/et_metal.mm

+                                                        length:nbytes
+                                                        options:MTLResourceCPUCacheModeWriteCombined | MTLResourceStorageModeShared
+                                                    deallocator:nil];
+


check for subBuffer for nullity/false-y

mergennachin · 2026-01-13T18:20:39Z

backends/apple/metal/runtime/shims/memory.cpp

+    ET_LOG(
+        Debug,
+        "aoti_torch__reinterpret_tensor: Adjusted original_data=%p, storage_offset=%lld, element_size=%zu, adjusted_data=%p",
+        data_ptr,
+        storage_offset,
+        dtype_to_element_size(dtype),
+        adjusted_data);
+
+    metal_buffer_nocopy(adjusted_data, tensor->nbytes(), true);
+  }
+
  // Increment the reference count for this memory address only if it is owned
  // by tensor
-  memory_to_n_tensor[data_ptr] = memory_to_n_tensor[data_ptr] == NOT_OWN
+  memory_to_n_tensor[adjusted_data] =
+      memory_to_n_tensor[adjusted_data] == NOT_OWN
      ? NOT_OWN
-      : memory_to_n_tensor[data_ptr] + 1;
+      : memory_to_n_tensor[adjusted_data] + 1;


Co-pilot review looks like legit:

if (adjusted_data != data_ptr) { ET_LOG(Debug, "..."); metal_buffer_nocopy(adjusted_data, tensor->nbytes(), true); memory_to_n_tensor[adjusted_data] = NOT_OWN; } if (memory_to_n_tensor[data_ptr] != NOT_OWN) { memory_to_n_tensor[data_ptr] += 1; }

mergennachin · 2026-01-13T18:23:45Z

backends/apple/metal/runtime/shims/memory.cpp

-  (void)new_handle;
-  throw std::runtime_error("Not implemented");
-  return Error::Internal;
+  ET_LOG(Debug, "aoti_torch_new_tensor_handle: entered");


Should you handle both zero and non-zero offset in this method?

what do you mean? aoti_torch_new_tensor_handle doesn't take an offset

manuelcandales added 7 commits January 13, 2026 10:26

export/lower parakeet to Metal backend

cb46032

handle grouped convolution

2495680

implement aoti_torch_mps_bmm_out

8daa204

linear decomposition

052fdfb

enable non-zero storage offset in aoti_torch__reinterpret_tensor

427c844

descriptive error for unimplemented shim functions

ee7c34c

implement aoti_torch_new_tensor_handle

fa0a53d

manuelcandales requested a review from mergennachin January 13, 2026 16:56

manuelcandales requested a review from lucylq as a code owner January 13, 2026 16:56

Copilot AI review requested due to automatic review settings January 13, 2026 16:56

manuelcandales requested review from cccclai and shoumikhin as code owners January 13, 2026 16:56

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 13, 2026

manuelcandales removed request for cccclai, Copilot, lucylq and shoumikhin January 13, 2026 16:56

Copilot started reviewing on behalf of manuelcandales January 13, 2026 16:57 View session

manuelcandales added the release notes: none Do not include this in the release notes label Jan 13, 2026

fix lint

69398ff

Copilot AI review requested due to automatic review settings January 13, 2026 17:23

Copilot started reviewing on behalf of manuelcandales January 13, 2026 17:23 View session

Copilot AI reviewed Jan 13, 2026

View reviewed changes

JacobSzwejbka reviewed Jan 13, 2026

View reviewed changes

mergennachin reviewed Jan 13, 2026

View reviewed changes

	ET_LOG(Debug, "aoti_torch__reinterpret_tensor: successfull");
	ET_LOG(Debug, "aoti_torch__reinterpret_tensor: successful");

	ET_LOG(Debug, "aoti_torch_new_tensor_handle: successfull");
	ET_LOG(Debug, "aoti_torch_new_tensor_handle: successful");

		return partitioner, programs


		def _linear_bias_decomposition(input, weight, bias=None):

Metal backend: enable Parakeet #16562

Are you sure you want to change the base?

Metal backend: enable Parakeet #16562

Uh oh!

Conversation

manuelcandales commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Parakeet export/lowering:

Operator updates:

Shim layer updates:

Uh oh!

pytorch-bot bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16562

❌ 1 Cancelled Job, 1 Unrelated Failure

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergennachin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

manuelcandales commented Jan 13, 2026 •

edited

Loading

pytorch-bot bot commented Jan 13, 2026 •

edited

Loading